DOCUMENT RESUME 



ED 435 693 



TM 030 324 



AUTHOR 

TITLE 

INSTITUTION 

PUB DATE 
NOTE 

AVAILABLE FROM 



PUB TYPE 
EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 



Verhelst, N. D.; Kaftandj leva, F. 

A Rational Method To Determine Cutoff Scores. Research 
Report 99-07 . 

Twente Univ. , Enschede (Netherlands) . Faculty of Educational 
Science and Technology. 

1999-00-00 

18 p. 

Faculty of Educational Science and Technology, University of 
Twente, TO/OMD, P.O. Box 217, 7500 AE Enschede, The 
Netherlands . 

Reports - Descriptive (141) 

MFOl/PCOl Plus Postage . 

♦Cutting Scores; Data Collection; Foreign Countries; *Item 
Response Theory; *Performance Based Assessment; *Standards 
♦Experts; ♦Standard Setting 



ABSTRACT 



A new method is proposed to set multiple standards in 
performance tests . The method combines three sources of information coming 
from three different data collections. The first is an empirical definition 
of mastery of an item; the second consists of parameter estimates of the 
items in an Item Response Theory (IRT) model, and the third source is a 
collection of experts* judgments on the relation between item mastery and 
level of performance. These judgments are given as an answer to very simple 
questions. The method is not iterative, and the experts are not required to . 
judge borderline persons. The standard setting procedure is simple and can be 
carried out without a computer. (Author/SLD) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 



TM030324 



m 

ON 

VO 

in 


A Rational Method 


Research 


m 

Q 


to Determine Cutoff Scores 


Report 


w 




99-07 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL 
HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 



N.D. Verhelst 

Cito, Arnhem / University of Twente, Enschede 
F. Kaftandjieva 

University of Jiv^skyla, Finland 






U.S. DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement 
EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

El This document has been reproduced as 
^ received from the person or organization 
originating it. 

□ Minor changes have been made to 



• Points of view, or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 




faculty of 

EDUCATIONAL SCIENCE 
AND TECHNOLOGY 




University of Twente 



Department 



Educational Measurement and Data Analysis 



BEST COPY AVAILABLE 



o 



A Rational Method to Determine Cutoff Scores 



N.D. Verhelst, 

Cito, Amhem / 

University of Twente, Enschede 
F. Kaftandjieva 

University of Jivaskyla, Finland 



O 

ERIC 



3 



Cutoff Scores - 1 



Abstract 

A new method is proposed to set multiple standards in performance tests. The method 
combines three sources of information coming from three different data collections. The first is 
an empirical definition of mastery of an item; the second consists of parameter estimates of the 
items in an IKT model, and the third source is a collection of experts judgements on the relation 
between item mastery and the level of performance. These judgments are given as an answer 
to very simple questions. The method is not iterative, and the experts are not required to judge 
borderline persons. The standard setting procedure is simple and can be carried out without 
computer. 
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Introduction 

In this paper a procedure is developed to find multiple cutoff points on a scale. The 
framework of the procedure can be described as follows. 

(1) The scale is described a priori in a number (R ) of ordered levels, which are meant to cover 
the whole range of the proficiency being measured. Each level is described in rather general 
terms of performance. 

(2) A number of items - larger than the number of categories- is constructed, administered to a 
sample, called calibration sample hereafter, and the responses are analyzed using a unidimen- 
sional IRT-model. The assumptions of the BRT-model are tested in an appropriate way, and 
possibly a number of items are discarded. It will be assumed in the sequel that the remaining 
items comply in a satisfactory way with the BRT-model used. Therefore the items together 
define a latent scale, and administration of the test to any person makes it possible to locate 
this person with known accuracy on the latent scale. The scale values will be symbolized by 
9. The number of items will be denoted I 

(3) J (> 1) experts in the subject field are given a training with the purpose to induce a quite 
homogeneous understanding of what is meant by the different levels of performance. Experts 
do not know the testees nor have ever seen any tables or statistics with information on the 
responses by the calibration sample. 

(4) The experts give, after training, answers to I{R — 1) questions, phrased as: ”Do you think a 
person at level r should should be able to answer this item (z) correctly?” (r = 2, . . . , R; i = 
1, . . . ,/). The experts have the text of the items at their disposal. The answers with regard 
to formerly presented levels are not available when answering a given level; so the experts 
cannot check their own consistency. The answers are binary (yes-no). The experts answer 
the questions independently of each other; so there is no discussion to reach agreement. 

Notice important differences with existing classical standard setting procedures (see Berk, 
1986, for an overview). The targeted population is not well defined, so that normative elements 
in determining the cutoff point do not enter the decision process. The main difference with 
important classical methods, however, is that there is no need to imagine a boderline person as 
in the methods of Angoff, Ebel and Nedelsky; reference is made (or intended to be made) to 
the concept of the level as induced in the training of the experts. As shall be come clear in the 
sequel, this vagueness can be a source of criticism to the procedure proposed, but it can also 
enhance the validity of the procedure. In this sense it is similar to Jaeger’s procedure (Cross et 
al., 1984; Jaeger, 1993), but in contrast to this procedure, it is not iterative, and there is a slight 
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difference in the phrasing of the main question to the experts. 

Although the method resembles the ’contrasting groups’ method of Livingston and Zieky 
(1982), here there are important differences as well. The experts in the present method do not 
classify any person they know, where in the contrasting groups method a sample of persons 
having taken the test are classified. Moreover, in the latter method the judgment of the expert 
concerns a decision about the person, whereas in the present method the judgment aims at 
gathering information on the relation between typical persons and the items of the test. 

The rationale for finding cutoff points rests on a comparison of the judgment of the experts 
and the characteristics of the itern as found in the calibration sample. There is, however, a third 
component in the procedure which is essential, viz. the concept of mastery of an item. This will 
be explained in the next section. 



Mastery of an item 

The advantage of the instruction given to the experts is that it is free of any probabilistic 
element. At the same time, however, it is not said explicitly what is meant by the phrase ’should 
give a correct answer’ . Does it mean a correct response always under any circumstance, or does 
it mean something like ’most of the time’? In other words, an explicit definition of mastery of 
an item is not given or induced. As will be seen in the sequel, lack of such a definition leaves 
the result of the procedure arbitrary to a certain extent. To find a unique solution, a definition 
of mastery has to be adopted. 

A reasonable approach might be to define mastery in probabilistic terms: an item is 
mastered by a student if he has a probability of at least p to give a correct answer, where p is some 
number between zero and one. Of course, the outcome of the standard setting procedure will 
depend on the precise value of p, and therefore the choice of p should be founded on empirical 
evidence which reflects in some sense a widely accepted definition. 

Such an empirical procedure might go as follows. A panel of experts expresses its concept 
of mastery for a test consisting of 50 parallel items, say, as a minimal proportion of correct 
responses. The average reported proportion - given there is not too much variability - can be 
taken as a measure of mastery. The important remark to make here, however, is that there is a 
procedure possible which is independent of the procedure described in the introductory section, 
and that it is not advisable to mix the two procedures. In the mastery definition procedure, 
the minimum requirement implicitly evokes the idea of a borderline person, while in the data 
collection described in the introduction no such reference is made 
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The loss ftmction 

To develop the argumention further, it will be assumed that the number p is fixed. As an 
example consider the situation where the items are calibrated using the two parameter logistic 
model (2PLM). The mastery level for item i is defined to be the minimal value of 6, such that 
the probability of a correct answer is at least p; this value will be denoted /q, and it is seen that 
it is the solution of the equation 

exp[Qi(/c^ - /3J] 

1 + exp[ai(/Ci - Pi)] 

where ai and Pi are the known discrimination and difficulty parameters respectively. The 
solution is given by 



Ki = Pi + -\n-^. 

OCi 1 - p 



( 2 ) 



er|c 



To find a rational cutoff point on the 0-axis which separates level r -I- 1 from level r, 
(r = 1, . . . , i? — 1), only the expert judgements collected in the r -h 1-th condition will be used. 
The unknown boundary value between level r and level r -I- 1 will be denoted Xr- Define the 
binary variable Dijr, with realizations dijr, as taking the value one if rater j judges that a person 
of level r should give a correct response to item i, and zero otherwise. 

The event Dijr = 1 will be interpreted as the statement that all persons with a 0-value not 
smaller than Xr master item i. But this implies, by the reasoning above, that the judge makes 
a prediction about the relative positions of Xr and /Cj, stating that /Cj < Xr- Now, for some 
choice of Xr such that this relation does not hold, for all 0 in the interval «^t)> prediction 
is wrong, and a positive loss is given which is a non-decreasing function of the length of the 
interval (xr,Ki). 

For the case Dijr — 0, the interpretation is not so clear. The negation of the statement 
given to the judge might mean: ”it is not true that every student at level r should master this 
item”, meaning that some do and some do not. It might also be interpreted that nobody at level 
r masters the item. This latter interpretation, however, leads to problems in case r = R, because 
it would imply that nobody ever can master that item. Therefore the former interpretation will 
be used, yielding, in terms of the latent scale, the prediction that /q > If the reverse relation 
holds, this prediction is wrong and a positive loss is associated with it, which is non-decreasing 
function of the length of the interval (/Cj, Xr). 
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Since finding the cutoff value is accomplished independently for each level, the subscript 
r referring to the level will be dropped from now on. Summarizing, the loss function in its most 
general form is defined as 



^ 3;) + (1 < x), (3) 

where ^i(.) and ^o(-) are non-decreasing continuous functions over the non-negative reals, and 
/(.) is the indicator function, taking the value 1 if its aigument is true and 0 otherwise. Because 
a weak inequality is used in the indicator functions, it is reasonable to require that the equality 
yi(0) = ^o(O) holds, and without loss of generality we can require that ^i(O) = ^o(O) = 0, 
which ensures that the minimal loss is zero. 

The overall loss function is defined as 



« j 

where Wi and Vj are fixed positive weights assigned to individual items and judges respectively. 

Of course, there is a laige number of possibilities in choosing the functions gi and go- A 
first choice occurs when a reason has to be found to choose different functions, reflecting that 
a wrong prediction should be penalized differently for different outcomes of the variable Dij. 
In the present context there seems to be no reason for doing this, so we choose 

^ ^1 = ^0 = ^ (5) 

Further considerations may concern continuity and differentiability of Ljj (x). A very simple 
(and for many reasons attractive) function is given by 

which makes a step function, jumping from 0 to 1 at Ki if Dij = 1 and the other way around 
if Dij = 0. To construct a continuous function, we must ensure that limj,_,o g{y) = 0. A class 
of functions which fulfill this requirement is given by 

9{y) = !(*, A > 0 (7) 
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If /c < 1, the function Lij is not differentiable at /Cj, but for /c > 1, it is. So, choosing k = 2 yields 
a differentiable loss function with all the desirable characteristics that squared loss functions 
have in multivariate analysis. An important advantage will become clear in the sequel. 

So, choosing gi = qq= g, and g{y) = y^, (3) can be written as 

{ (/Ct - xY if Dij = 1 and Ki> x, 

{Ki - x)^ if Dij = 0 and Kj < X, (8) 

0 otherwise. 

Of course, the definition of the overall loss function (4) remains the same. The optimal 
value of X is defined that value that minimizes the overall loss function. Although (8) looks 
simple enough, the minimization is not trivial: for given x the value of (x) can assume only 
two different values. Which one applies, however, depends not only on the data D, but also on 
the value of x itself. Therefore, the loss function is not a simple quadratic function (with fixed 
coefficients), but it has variable coefficients. The procedure to minimize this function will be 
discussed in the next section. 



Minimizing the loss function 



Without loss of generality we can assume that the items are ordered in increasing order of 
their k- value. Now, consider the closed interval [/Cp,/Cg+i] for some 1 < y < /. This interval 
will be referred to as the y-th interval in the sequel. For all values of x in this interval, the tmth 
value of the two conditions given in the right hand side of (8) cannot change for any of the I 
items. Define the weights f-^\ (i = 1, y = 1, 1) as 

f(s) = S - dij) if * < 5, ' .Q. 

* 1 J2j 'Vjdij ifi> 9- ^ ^ 

One of the following events must occur: either all weights are zero, or there is at least one 
positive weight. The latter case, which is the most interesting one, will be dealt with first. 

The weight is the coefficient of the positive loss (/Cj - x)^ in the overall loss function. 
If the restriction that x must be in the y-th interval is dropped, the overall loss function reduces 
to a simple quadratic function whose (unique) minimum - remind the advantage of choosing 
quadratic loss functions - is given by 




e,/7’ 



j,(») = 



( 10 ) 
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Restriction to the 5 -th interval gives immediately that the minimum of the loss function in this 
interval is attained at 

( i^g if 2/^®^ < 

= S «ff+i if 2/^®^ > (11) 

( if Kg < < K 5+1 

To complete the search, also the intervals (—00, /ci] and [k/, +00) must be investigated. 
Consider the former interval, to be called the zero-th interval. Since is a weighted average 
of the k’s, it will necessarily follow that , and therefore 

= Ki. 



By a similar reasoning, it holds that 



m 



(/) 



= Kj. 



So the minimum of the overall loss function is at which is defined by 

= min 
9 



( 12 ) 



where g ranges from 1 to / — 1 . 

Of course this minimum exists, but it does not follow from the procedure described above 
that it is unique. There might be two local minima, which in some cases could be equal. To show 
that this is impossible, consider Figure 1 which shows the loss function in the neighborhood of a 
minimum. The function is piecewise quadratic, and is continuously differentiable everywhere. 
Therefore the loss function is aspline of degree two with knots at the /c-values. Notice, that since 
the graph of the function in any interval coincides with a parabola with a minimum, the function 
is convex everywhere. Now, if there are two local minima, there must be a (local) maximum 
in between, but this implies that the function should be concave in some region, which is not 
possible. Therefore the minimum is unique. 
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insert Figure 1 about here 



Next, the case is considered where all weights are zero. To see how this can happen, 
consider a hypothetical example with two judges, giving djj values as shown in Table 1, where 
the columns represent the items 



Table 1 . Hypothetical data 



1 ■■ 


• i 


i-fi ■■ 


■ I 


1 ■■ 


■ 1 


0 


■ 0 


1 


• 1 


0 


• 0 



It is easy to see that for every x in the i-th interval there are no mastery points («-values) to 
the left of X such that any of the judges stated that non-mastery of the corresponding item is 
allowed, and similarly for mastery points to the right of x: mastery is not necessary according 
to both judges. So there will be zero penalty for each of the judgments, and the loss function 
attains its lower boundary for all values in the i-th interval. If this interval has positive length, 
it follows that the minimum is not unique. 

Notice that since the items are ordered in increasing value of «, and assuming that all 
^-values are different from each other, the case of zero loss can only occur if each judge give a 
Guttman pattern of responses (i.e. a 1 never follows a 0), and if all these Guttman patterns are 
equal across judges. If there are ties in the ^-values, the definition of the Guttman pattern can 
be relaxed slightly: a 1 must be assigned to all items whose k is less that the tied «, and a 0 to 
all items with a larger «. If this happens for all judges, the minimum of the loss function is at 
the tied «, and of course is unique. 

It will be clear from consideration of Table 1 that the interval with zero loss is unique. This 
interval may also be (— oo, «i] or [«/, -t-oo), corresponding to the cases where for all judges 
and all items Dij = 0 or Dij = 1 respectively. 

Conclusion 

A procedure has been proposed to find multiple cutoff points on a latent continuum 
which is defined by a number of items, the responses to which can be adequately modeled 
by a unidimensional IKT-model. To apply the procedure, three kinds of information must be 
available. We discuss these three in short. 
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The central concept in the whole procedure is mastery of an item. It is proposed to define 
this concept as the minimal latent ability required to have at least a probability p of a correct 
answer. An experiment was described on how the value of p might be determined. Once there 
is agreement on this value, the mastery criterion Hi for each item is uniquely determined for all 
(parametric) models with increasing item response functions. For models with non-monotone 
item response functions, this issue may problematic. As far as achievement or attainment 
testing is considered, monotone item response function are the mle; in attitude measurement, 
the concept of mastery is probably not adequate. This is a serious limitation of the method 
proposed in this article. 

Of course, to have an estimate of the K-values, observations must be collected from a 
calibration sample from the target population. This may be quite a difficult problem if the target 
population is not known, as for example when the test is to be administered in the future via 
internet to whoever is interested in it. Updating the item parameter estimates and checking the 
validity of the IRT-model at regular intervals when new data come in seems the only possible 
way out from this problem. 

Another aspect related to the calibration is the accuracy of the parameter estimates. In the 
development above, the /t-values are treated as constants, but of course there is some estimation 
error associated with them. Good practice may be to choose the weights Wi in (4) inversely 
proportional to the square of the standard error of the item parameter estimates, (or even better, 
to extend the procedure to incorporate the covariances as well). 

The third kind of information is the expert judgment on mastery or non mastery of the 
items by a rather vaguely described person of a well described level of proficiency. The vague 
description is used to let the expert make full use of his own representation of the intended level 
of proficiency, without further qualification such as ’borderline’ persons or even some concrete 
persons he happens to know. At the same time, asking probabilistic statements is avoided, 
which may be especially attractive for experts in domains with little mathematical thinking 
and experience, such as language testing, for example. But even teachers of mathematics find 
the yes/no method ’(...)clearer and easier to use than the more traditional Angoff probability 
estimation procedure’ (Impara and Flake, 1997).Of course, one might criticize the procedure 
in as much that it suggests that deterministic statements are asked for, like ’a person of level A 
should make this item always correct’, meaning that a single error on many similar items would 
imply that this person cannot be of that level. But this would contradict common sense and 
educational practice, where it is not expected that the most excellent student should obtain the 
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maximum score on all the examinations to deserve the highest degree. 

The proposed method integrates these three kinds of information in an easy way to arrive 
at a rational determination of the multiple cutoff points. Of course the validity of the whole 
procedure depends on the validity of the constituent parts. Not only the calibration results must 
be used to test the validity of the IRT-model, also the validity of the expert judgments is at stake, 
as well in the definition of the mastery level as in the phase of determining cutoff points. In 
the former case, there must not be too large variation in the probability statements; in the latter 
procedure, not too many deviations from Guttman patterns (see Tkble 1) must occur, and if only 
Guttman patterns occur (perfect intra-rater consistency), the location where the ones switch to 
zeros must not differ too much across judges (inter-rater consistency). If these requirements are 
violated to a large degree, the procedure to determine cutoff points may still be applied, possibly 
with different weights assigned to the judges, but the validity of the results may be questionable. 
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Figure 1. 

The loss function and the parabola associated with the leftmost interval 
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